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Abstract 

Mammalian genomes comprise many active and fossilized retroelements. The obligate requirement for retroelement 
integration affords host genomes an opportunity to 'domesticate' retroelement genes for their own purpose, leading to 
important innovations in genome defense and placentation. While many such exaptations involve retroviruses, the L1TD1 
gene is the only known domesticated gene whose protein-coding sequence is almost entirely derived from a LINE-1 (LI) 
retroelement. Human L1TD1 has been shown to play an important role in pluripotency maintenance. To investigate how 
this role was acquired, we traced the origin and evolution of LITDh We find that L1TD1 originated in the common ancestor 
of eutherian mammals, but was lost or pseudogenized multiple times during mammalian evolution. We also find that L1TD1 
has evolved under positive selection during primate and mouse evolution, and that one prosimian L1TD1 has 'replenished' 
itself with a more recent LI ORF1 from the prosimian genome. These data suggest that L1TD1 has been recurrently selected 
for functional novelty, perhaps for a role in genome defense. L1TD1 loss is associated with LI extinction in several megabat 
lineages, but not in sigmodontine rodents. We hypothesize that L1TD1 could have originally evolved for genome defense 
against LI elements. Later, L1TD1 may have become incorporated into pluripotency maintenance in some lineages. Our 
study highlights the role of retroelement gene domestication in fundamental aspects of mammalian biology, and that such 
domesticated genes can adopt different functions in different lineages. 
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Introduction 

Retroelements have profoundly shaped mammalian genomes 
over millions of years. Insertion of these selfish elements can 
lead to gene inactivation or changes in transcriptional profiles of 
neighboring genes. Moreover, the presence of large stretches of 
almost identical sequence distributed across the genome poses 
recombinational hazards, leading to chromosomal rearrange- 
ments, often with pathological consequences. With a few 
celebrated exceptions [1], the retrotransposition activity of 
retroelements is rarely beneficial [2]. However, individual 
transposition events can occasionally drive dramatic episodes 
of adaptation through the generation of genetic novelty [3]. 
Indeed, repetitive elements have often been co-opted as 
transcriptional regulatory elements like promoters, enhancers, 
and insulators [4]. 



One particularly striking class of such genetic novelty results 
from the 'domestication' or 'capture' of retroelement- and 
retrovirus-derived protein-coding genes by host genomes, a 
process by which these domesticated coding regions are exapted 
for a new function that is beneficial to the host [5,6]. Such 
instances are often recognized by the preservation of individual 
protein-coding frames despite the mutational decay of the rest of 
the parental retroelement. This extinction of the replication- 
competent retroelement but not its protein-coding gene implies 
that selection for the benefit of the host genome must have 
prevented the mutational attrition of that particular gene. 

Retroelements and retroviruses must integrate into the host 
genome as part of their replication cycle, presenting the host with a 
source of potentially advantageous protein-coding sequences. The 
syncytin genes of eutherian mammals represent some of the best 
characterized examples of such domestication events in which 
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Author Summary 

Transposable elements comprise major portions of most 
animal genomes and are selfish genetic elements that may 
encode proteins needed for their own spread to new 
genomic locations. Though often considered genomic 
parasites, these elements also occasionally create novel 
genes that prove beneficial to the host, a process called 
'domestication'. Here, we describe the evolution of a gene, 
L1TD1, which is derived from the protein-coding regions of 
the LI mobile element family. We show that L1TD1 was 
born in the common ancestor of placental mammals. 
L1TD1 expression in stem cells and its requirement to 
maintain the pluripotent state of human embryonic stem 
cells suggested it might have been originally domesticated 
for such a pluripotency role. We find that LITDl's evolution 
does not fit with the predictions of this model; in fact, 
L1TD1 has rapidly evolved in primates and mice and has 
been lost several times in mammals. We suggest an 
alternate model that L1TD1 was born as a means to defend 
genomes against transposable elements, perhaps LI itself. 
We propose that following this initial domestication, L1TD1 
later became incorporated into pluripotency programs in 
some mammalian lineages. 

envelope genes from ancient retroviruses have been preserved for 
their membrane-fusing and/ or immunosuppressive activities in the 
syncytiotrophoblast, the layer of the placenta which mediates 
maternal-fetal nutritional transfer [7-9] . In fact, loss of syncytin- A 
in mice is embryonic lethal, consistent with its indispensable role in 
placental function [10]. Syncytin genes thus represent a dramatic 
example of the maintenance and possibly invention of an essential 
function, placentation, via retroviral gene domestication. Impor- 
tandy, syncytin is not unique in this regard; similar domestications 
of sushi-ichi LTR retrotransposon protein-coding regions have 
also created multiple host genes (e.g., PeglO) involved in genomic 
imprinting and placentation [11,12]. 

Though syncytin represents a dramatic and beneficial genetic 
innovation, not all domesticated retroviral genes serve conserved, 
essential functions. The Fvl gene in mice represents a domesti- 
cated gag (viral capsid and nucleocapsid encoding) gene, which 
can actively interfere with the uncoating of incoming retroviral 
capsids [13,14]. As expected for this genome defense role, Fvl is 
under strong diversifying selection, presumably as a result of 
constant innovation required to recognize and block different 
retroviruses [15]. As a result of this diversifying selection to chase 
the sequence of incoming capsids, orthologous genes can have 
dramatically different antiviral specificities [16-19]. Fvl has also 
been lost or pseudogenized at least twice in the Mus genus [15], 
perhaps because retention of Fvl may depend on persistence of 
selection from incoming viruses. 

Mammalian genomes acquired both syncytin and Fvl genes as 
a result of the insertion of a retrovirus into the germline. Though 
such endogenous retroviruses have markedly impacted the 
mammalian genome, these elements have spent relatively little 
time coevolving with their host compared to the non-LTR 
retroposons, which date back at least to the origin of Metazoa 
[20]. The LINE-1 (Long INterspersed Element-1, LI) non-LTR 
retroposons make up a significant fraction of the human genome. 
Given their ancient history of coevolution with mammalian 
genomes, it is not surprising that there are numerous examples 
of exaptation of non-LTR retroelements into non-coding RNAs, 
as promoter or other regulatory elements, or as small portions of a 
coding region [4,6,21-24]. Many coding region exaptations are 



"exonizations", where cryptic splice sites within an intronic 
repetitive element are utilized so that a portion of the retroposon 
is incorporated as a novel exon of a host gene transcript, often as a 
minor splice isoform of the gene [25-28]. For example, a portion 
of an RTE non-LTR element was exapted as an additional coding 
exon of the existing ruminant bucentaur gene [29]. In contrast, 
only a single instance of a novel protein coding gene formed 
entirely anew from the domestication of a non-LTR retroposon 
has been described - L1TD1, or LINE-1 type Transposase 
Domain-containing 1 (L1TD1 has no known enzymatic activity 
despite its designation as a 'transposase' domain, see below). 
Though originally identified among a set of genes specifically 
expressed in murine embryonic stem cells, hence its original name 
ECAT11 (Embryonic Stem Cell Associated Transcript 11) [30], 
LI FD1 's origins and evolution remain poorly characterized. 

In both human and mouse, L1TD1 expression is high in 
undifferentiated stem cells and decreases precipitously upon 
differentiation [31]. In humans, L1TD1 is intricately linked with 
the cellular program for pluripotency maintenance, where it 
functions in a regulatory network with the core stem factors 
NANOG, SOX2, and OCT4. L1TD1 expression appears to be 
tightly controlled by these pluripotency factors, which all bind the 
L1FD1 promoter [32]. Depletion of L1FD1 in human ES cells 
results in an immediate downregulation of OCT4 (also known as 
POU5F1) and NANOG, and is sufficient to induce differentiation. 
In contrast, L1TD1 in mice is completely dispensable for the 
maintenance and induction of pluripotency [30,33]. These studies 
suggest that despite similarities in stem-cell specific expression, the 
function of L1TD1 in pluripotency maintenance may have 
changed since the common ancestor of rodents and primates. 

We wished to understand how a retroelement-derived gene 
could become incorporated into such an important cellular process 
in mammalian cells. To this end, we traced the evolutionary 
origins, history, and selective pressures of the L1TD1 gene in 
mammals. In contrast to the expectation that L1TD1 would be an 
essential gene if it had been domesticated for its function in 
mammalian pluripotency, we find several cases where LI TD1 has 
been lost as well as several cases of selection for genetic innovations 
in L1TD1. This leads us to propose an evolutionary transition 
model wherein L1TD1 was first retained for a role in genome 
defense in stem cells and germline (similar to Fvl). Subsequently, 
LI FD1 may have been exapted for an essential function (similar to 
syncytin) in pluripotency maintenance in a subset of eutherian 
mammals. 

Results 

L1TD1 originated in the common ancestor of placental 
mammals 

To gain insight into the cellular function of L1TD1 , we decided 
to date its evolutionary origin and examine how its gene structure 
has changed since its birth. Since the only published analysis 
examined five mammalian genomes for its presence [33], we 
decided to search for the L1TD1 gene in a much larger sample of 
diverse mammalian genomes. In the human and mouse genomes, 
L1TD1 is found between the single-copy INADL and KANK4 
genes (Figure 1A). We found that all mammals and even bird 
genomes encode INADL and KANK4 in close proximity to each 
other. We therefore used these flanking genes to identify the 
syntenic locus in other mammalian genomes, and used the 
sequence from this locus to identify L1TD1 coding sequences 
where present. We also extended our search to the rest of the 
genome, using repeat-masked and unmasked human and mouse 
LI FD1 sequences as BLAST search queries. We were never able 
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Figure 1. The mammalian L1TD1 gene was born from a tandem insertion of LI ORF1. A. Genomic context of L1TD1. Single-copy genes 
INADL and KANK4 flank L1TD1 in human, mouse and dog genomes; this shared syntenic arrangement helped us identify L1TD1 orthologs in other 
mammalian genomes. B. L1TD1 evolved according to the accepted species tree. We aligned L1TD1 nucleotide sequences and generated a maximum- 
likelihood phylogeny (see Materials and Methods). Bootstrap values show the percentage of 1000 replicates in which descendent taxa cluster 
together, and the scale bar shows substitutions per site according to the GTR+I+G evolutionary model. C. L1TD1 comprises two L1 -ORF1 p-like 
regions. L1 elements encode an approximately 6.5 kb transcript containing two open reading frames. L1 ORF1 encodes a protein, ORF1 p, with RNA- 
binding and chaperone activity. ORFIp contains a coiled-coil motif (CC), a RNA-recognition motif (RRM), and a C-terminal domain (CTD). ORF2 
encodes a protein with endonuclease and reverse transcription enzymatic functions. Sequence identity demonstrates that L1TD1 was formed from 
the domestication of two copies of ORF1 from L1. The two copies may derive from independent insertions or from duplication after a single insertion. 
Human coding exon 1 and coding exon 2 share 30% and 43% amino acid identity, respectively, with ORF1 p of human LI .3. In coding exon 1, only the 
CTD is conserved, while in coding exon 2, the CC, RRM, and CTD are all conserved. Coding exon 2 also contains a variable length glutamic acid-rich 
region (ER). After splicing, the human L1TD1 transcript is 3849 nucleotides in length and encodes a single 865 amino acid protein product. 
doi:1 0.1 371 /journal.pgen.1 004531. g001 



to identify an intact copy of L1TD1 outside the INADL-KANK4 
syntenic locus, although we did find some pseudogenes with 
obvious inactivating mutations in other locations. Although the 
previously published analysis [33] suggested that dog L1TD1 lacks 
a portion of coding exon 1 , in fact we find that LI TD1 is complete 
and intact in the dog genome (Table S2). 

We found L1TD1 in many diverse placental mammalian 
genomes (Table S2, Dataset SI). However, L1TD1 is absent from 
both marsupial and platypus genomes. This species distribution 
implies that the L1TD1 gene was born in the common ancestor of 
placental mammals, after the split from marsupials. A phylogenetic 
tree using L1TD1 nucleotide sequences follows the expected 
species tree (Figure IB). Together with the observed shared 
syntenic location, this tree demonstrates that the sequences we 
identified represent truly orthologous genes rather than several 
independent LI domestication events. This dates the origin of 
L1TD1 to at least 106 million years ago, making it more ancient 
than the well-known domesticated retroelement genes syncytin and 
Fvl [9,15]. 

Two L1 ORF1 -homologous regions constitute the L1TD1 
gene 

Human L1TD1 (RefSeq NM_001164835.1) comprises two 
protein-coding exons (of four total exons), which together encode 
an 865 amino acid protein. Each of these two exons is homologous 
to the first open reading frame (ORF1) of LI (Figure 1C), whose 
protein product (ORFIp) functions as an RNA-binding protein 
that gready enhances LI retrotransposition [34]. ORFIp appears 
to be important in ensuring LI 'cis-preference' - the preference for 
the ORF2 protein (ORF2p) to act upon the same RNA from 



which it was translated, so that an Li's machinery is less often 
'hijacked' by other elements or other Lis [34—36]. L1TD1 has no 
discernible homology to the other open reading frame of LI, 
ORF2, which encodes the enzymatic activities of LI. The second 
coding exon of L1TD1 has higher conservation with ORFIp, 
showing 43% amino acid identity (58% amino acid similarity) with 
human LI (LI. 3), and apparent preservation of the coiled-coil, 
RNA-recognition, and C-terminal motifs of Ll-ORFlp (Fig- 
ure 1C). In contrast, the first coding exon retains only 30% amino 
acid identity (36% similarity) with ORFIp (LI. 3), preserving the 
C-terminal domain but not the coiled-coil or RNA-recognition 
motif. This greater sequence identity in the second coding exon 
could either reflect a more recent origin from L1-ORF1 or greater 
constraint to preserve ancestral Ll-ORFlp like functions. Coding 
exon 2 also encodes a ~300 amino acid low-complexity glutamic 
acid-rich region that separates the two ORF1 p-like regions. The 
annotated mouse L1TD1 gene (RefSeq NMJ30 1 081202.1) is 
similar to human L1TD1 but includes an extra intron that 
removes ~200 nucleotides encoding ~70 amino acids of this low- 
complexity region. However, we find that the reading frame is 
maintained through this mouse intron, and EST data suggest that 
it is often retained in the mature mouse L1TD1 transcript. 
Overall, we find that L1TD1 has maintained conservation with 
ORFIp of LI despite its independent evolution for tens of millions 
of years; this suggests some aspect of LI ORFIp function may still 
be utilized in L1TD1. 

To discern the order of the LI domestication events that led to 
the origin of LI TD1 and to confirm our dating of its birth, we 
generated a phylogeny of the two L1TD1 ORF1 p-like amino acid 
sequences together with representative LI ORFIp sequences from 
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diverse mammalian genomes (Figure 2). We find that L1TD1 N- 
terminal regions (the protein product of coding exon 1) group 
together with strong bootstrap values (node A), as do L1TD1 C- 
terminal regions (the protein product of coding exon 2) (node B), 
demonstrating that the double ORFlp structure arose just once 
since the divergence of placental mammals and has not been 
subject to gene conversion between the two exons since. 
Furthermore, the L1TD1 N-terminal and C-terminal clades 
branch off from placental mammal LI -ORFlp sequences (node 
C) after marsupial and placental mammal LI -ORFlp sequences 
diverged (node D). This supports our conclusion that L1TD1 was 
born via LI domestication in the ancestral eutherian mammal. 
Our phylogeny cannot distinguish whether the two homologous 
regions of L1TD1 derived from two independent LI insertions in 
proximity to each other, or whether a single LI inserted and 
subsequently experienced tandem duplication. Examination of the 
relative splice acceptor positions in coding exons 1 and 2 also does 
not help to distinguish these two possibilities. Nevertheless, it does 
appear that both coding exons were born in close temporal and 
physical proximity to each other, giving rise to the ancestral 
LI TD1 gene. Our analyses further show that this bipartite double- 
ORFlp organization oiLlTDl has been conserved since its birth. 

Multiple, independent losses of L1TD1 in mammals 

Despite its widespread conservation, our genome-wide searches 
nevertheless revealed that L1TD1 has been lost on at least three 
separate occasions during the evolution of placental mammals 
(Figure 3). First, we find that L1TD1 is missing in all three 
Afrotherian genomes we surveyed - elephant, hyrax and tenrec. 
These three species constitute an approximately 85 million-year- 
old subclade. However, we find L1TD1 is intact and present in the 
armadillo genome, which diverged from the three sequenced 
members of Afrotheria nearly 100 million years ago. Using 
parsimony, we infer that L1TD1 was lost once in the ancestor of 
elephant, hyrax and tenrec between 84 and 100 million years ago. 

Second, we find that L1TD1 is missing from the genomes of all 
Cetartiodactyla. This clade of mammals originated 64 million 
years ago and comprises even-toed ungulates, whales, and 
dolphins. All sequenced genomes of this clade (cow, sheep, 
dolphin, pig and alpaca, Figure 3) lack LI TD1 . In contrast to its 
loss in Cetartiodactyla, L1TD1 is present in nearly all other 
members of the 80 million year old Laurasiatheria clade, which 
includes Cetartiodactyla. The most parsimonious explanation is 
that the common ancestor of Cetartiodactyla lost L1TD1 between 
64 and 80 million years ago. 

In both the Afrotheria and Cetartiodactyla, we find no trace of 
the L1TD1 gene anywhere in the genome, perhaps indicating loss 
of L1TD1 via genomic deletions of the whole gene. However, 
given that the Afrotheria and Cetartiodactyla loss events likely 
occurred at least 64 and 84 million years ago, it is equally likely 
that after pseudogenization the L1TD1 sequence has simply 
degenerated beyond recognition in these lineages. 

The third instance of loss oiLlTDl occurred in the genome of 
the megabat Pteropus vampyrus (Figure 3). The microbat Myotis 
lucifugus still encodes an intact LI TD1 , but P. vampyrus contains 
a pseudogenized version of L1TD1 with multiple frameshifts and 
stop codons. In contrast to the complete lack of a recognizable 
L1TD1 in the Afrotheria and Cetartiodactyla, L1TD1 is still 
discernible in the megabat, suggesting this loss may be more 
recent. 

These three loss events in mammals strongly challenge the 
possibility that L1TD1 was born into a role of pluripotency 
maintenance in the common ancestor of placental mammals, as 
we would expect such a gene to be essential and not subject to 
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Figure 2. Phylogenetic tree of representative LI -ORFlp 
sequences and mammalian orthologs of the two L1TD1 
ORFIp-like regions. Predicted protein sequences of L1 ORFlp and 
the ORF1 p-like regions encoded by each L1TD1 exon were aligned and 
this alignment was used to generate a maximum-likelihood tree (see 
Materials and Methods). Bootstrap values show percentage of 1000 
replicate trees in which the descendent taxa clustered together (only 
values >50% shown). The scale shows the number of substitutions per 
site. The tree was rooted using fish swimmer elements as outgroups. 
The N-terminal L1TD1 ORF1 p-like sequences cluster together with a 
high bootstrap value (node A), as do the C-terminal L1TD1 ORF1 p-like 
sequences (node B), confirming that we have identified true L1TD1 
orthologs and that the double ORFlp structure arose just once since 
the divergence of placental mammals and has not been subject to gene 
conversion between the two exons since. The L1TD1 N-terminal and C- 
terminal clades branch off from placental mammal L1 -ORFlp sequences 
(node C) after marsupial and placental mammal LI -ORFlp sequences 
diverged (node D). The tree does not help to distinguish whether L1 
ORF1 was independently domesticated twice, or just once with a 
subsequent genomic tandem duplication. Extensive sequence diver- 
gence between paralogous ORFIp-like sequence means that deep 
nodes of the tree are poorly resolved. Nonetheless, the tree supports 
our model in which both L1TD1 exons were born after marsupials and 
placental mammals diverged. 
doi:1 0.1 371/journal.pgen.1 004531. g002 



subsequent loss. It is formally possible that these lineages 
independendy acquired another LI TD1 -like activity that allowed 
the loss of the originally acquired L1TD1. However, it is more 
parsimonious that LlTDVs role in pluripotency maintenance was 
acquired much later than its birth. 

The loss of L1TD1 in mammals parallels the multiple losses of 
the Fvl restriction factor in the Mus genus [15]. We therefore 
considered whether, like Fvl, L1TD1 may have rapidly evolved 
under positive selection due to some role in a genetic conflict. 



PLOS Genetics | www.plosgenetics.org 



4 



September 2014 | Volume 10 | Issue 9 | e1004531 



Origin and Evolution of L1TD1 in Mammals 



INADL 




human 
gorilla 
macaque 
marmoset 
aye-aye 
bushbaby 
mouse 
rat 

Chinese hamster 
guinea pig 
rabbit 
pika 
cow ' 
sheep 
dolphin 
pig 



INADL : 



panda 

megabat -^^-^ 
microbat 
elephant 
hyrax 
tenrec 
armadillo 
opossum 
wallaby 

Tasmanian devil 
platypus 



Figure 3. L1TD1 has been lost multiple times in eutherian 
mammals. A species tree shows the presence or absence of L1TD1 
across mammals. Arrows depict LITDVs genomic locus; black (INADL) 
and white {KANK4) arrows depict the flanking genes we used to identify 
syntenic regions, and the blue/green arrow depicts L1TD1. LITDVs 
presence in the armadillo genome but not in platypus, opossum, 
wallaby or Tasmanian devil indicates it was most likely born before the 
divergence of placental mammals, but after divergence from marsupials 
(solid branches). L1TD1 function was lost in three lineages (X's, and gray 
branches); it is present as a pseudogene in megabat and entirely 
missing from Afrotherian and Cetartiodactylan genomes. The bushbaby 
L1TD1 gene acquired a novel N-terminal region (depicted in red) 
through a more recent L1 ORF1 p domestication event (red asterisk) that 
occurred after bushbaby diverged from lemurs (Figure 4B). 
doi:1 0.1 371 /journal.pgen.1 004531. g003 

Genetic innovation of primate and mouse L1TD1 
supports a 'genetic conflict' hypothesis 

If L1TD1 were indeed playing a role in genome defense against 
some pathogen, one hallmark of the ensuing conflict might be the 
rapid evolution of L1TD1 coding sequence, a signature commonly 
seen at many host-virus interaction interfaces [37]. Such rapid 
change would be expected if the coevolving pathogen constantly 
evolved to evade L1TD1 recognition, in which case L1TD1 would 
be expected to rapidly evolve to 're-establish' recognition of the 
rogue element. Indeed, it has been found that primate LI ORFlp 
experienced an episode of adaptive evolution [38], consistent with 
the idea that LI could be evolving to escape some sort of genome 
defense factor. 

Due to the finding of positive selection on primate LI ORFlp 
[38] and the availability of multiple primate genome assemblies 
and DNA samples, we first focused on characterizing signatures of 
evolutionary selection in L1TD1 of primates. Through database 
searches and PCR-based sequencing of genomic DNA samples, we 
assembled 18 L1TD1 sequences spanning the simian primate 
phylogeny, representing more than 40 million years of evolution 
(Figure 4A). Consistent with LITDVs important role in pluripo- 
tency maintenance in human ES cells, we found no instances of 
LI TD1 loss or pseudogenization in primates. Overall, we found a 
high degree of conservation in L1TD1 in primates, with average 
dN/dS of ~0.5 over the primate phylogeny. dN/dS is a 
normalized ratio indicating whether amino acid-altering 



evolutionary changes occur more often than expected given the 
rate of neutral mutations, with values < 1 indicating overall 
conservation, and values > 1 indicating overall positive selection. 

Though we observe an overall signature of purifying selection in 
L1TD1, there could nonetheless be signatures of diversifying 
selection occurring on just a few domains or residues. To assess 
whether positive selection has acted on primate L1TD1, we used 
maximum likelihood methods (NSsites models in the PAML suite 
[39]) to test for positive selection at individual codons. We found 
that a model permitting positive selection in the alignment fit the 
primate data significantly better than models that disallow positive 
selection (MB vs. M7, M8 vs. M8a, p<0.01). Both coding exons 
show evidence of positive selection, suggesting that both domes- 
ticated ORFlp-like regions have been recurrendy selected for 
functional novelty. Further, the positive selection is remarkably 
localized. Only a few L1TD1 codons (~1%) show a signature of 
positive selection, but these codons have a high average dN/dS (> 
8, Table 1). Such highly localized signatures of recurrent positive 
selection often represent direct contacts with antagonistic entities 
(discussed below). We note there is no overlap between the 
positively selected sites in primate L1TD1 and the regions 
previously found in LI ORFlp, which are largely in the coiled- 
coil domain [38,40]. FUBAR, carried out in the HyPhy suite of 
programs confirmed our finding of highly localized positive 
selection (Materials and Methods, Table 1) (Murrell, 2013 #). 

Since we hypothesize that the ancestral function of L1TD1 may 
have been genome defense, we expect that L1TD1 should have 
evolved under positive selection within many branches of the 
mammalian phylogeny. To generalize our finding of pervasive 
positive selection in the primates, we analyzed the evolution of 
L1TD1 within the genus Mus. We assembled 10 complete L1TD1 
sequences from databases, PCR-based sequencing, and RNA-seq 
data. Again, all species contained an intact coding sequence with a 
high degree of conservation (average dN/dS = 0.46). However, 
similar to the primate analysis, PAML NSsites found a small 
proportion of sites with a high average dN/dS (~3%, dN/dS > 
12), as well as a highly significant gene-wide signature of positive 
selection (Figure 4A; M8 vs. M7, M8 vs. M8a, p<0.01). We also 
found statistical support using FUBAR for 8/12 positions 
identified as positively selected by PAML (Table 1; PAML M7 
vs M8 BEB, P>0.9; FUBAR P>0.9). To eliminate any false 
positive selection signals that could arise from phylogenetic 
discordance, we used the HyPhy program CARD to identify 
potential recombination breakpoints in the Mus L1TD1 alignment 
[41]. The alternative phylogenies given by CARD attempt to 
correct for any recombination that may have occurred and given 
rise to a scenario where no single tree accurately fits the entire 
L1TD1 sequence. While CARD found no statistically significant 
breakpoints according to the KH test, we nonetheless tested 
whether Mus L1TD1 retained a statistically significant signature of 
positive selection integrating the generated alternative phylogenies 
(Figure SI). Using the GARD-generated trees, we still observed a 
strong signature of gene-wide positive selection according to 
PAML (p<0.01) and FUBAR still identified 7 positively selected 
positions. 

In addition to phylogeny-wide selection at a few specific codons, 
branch-specific analyses of dN/dS ratios highlighted episodic 
positive selection along a few specific branches of the primate 
phylogeny including the branch leading to the common ancestor 
of spider monkey and titi monkey, as well as the branch leading to 
the OWMs and hominoids (Figure 4A, red asterisks, Branch-site 
REL, p<0.05) [41]. Using PAML, the branch leading to the 
common ancestor of gorilla, human and chimpanzees (after the 
split from orangutan) showed a whole gene dN/dS of 2.87 (1 
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Figure 4. Novelty in L 1TD1 of primates and mice. A. Site-specific PAML analyses reveal a signature of positive selection in L1TD1. The labels on 
the annotated schematic indicate positions that are highly likely to be evolving under positive selection (P>90%) according to PAML NSsites 
(Table 1) in primates and mice (above and below the gene diagram, respectively). A species tree of primates or an L1TD1 gene tree of mice, shows 
branches with statistically significant episodic diversifying selection (p<0.05) according to HyPhy's Branch-site REL (marked with a red asterisk). To 
the right of each tree, the amino acids found at each positively selected position are shown, along with the length of the glutamic acid-rich region in 
each primate. Position numberings are based upon the human and M. musculus (C57/BL6) sequences. B. The L1TD1 gene of bushbaby has acquired a 
novel 5' end of coding exon 1 through the insertion of a portion of a LI element from the L1PA15-16 class (shown in red). The gene retains high 
sequence conservation with L1TD1 of lemurs and simian primates across the latter half of its first coding exon and all of its second coding exon 
(shown in blue and green). This insertion is unique amongst all the species we have examined, and is not evident in lemur genomes (mouse lemur or 
aye-aye). Elsewhere in the genome, bushbaby contains at least two complete and two partial processed L1TD1 pseudogenes that allowed us to infer 
the structure of the active L1TD1 gene. 
doi:1 0.1 371 /journal.pgen.1 004531. g004 



synonymous change and 1 1 nonsynonymous changes); the branch 
preceding it shows a whole gene dN/dS of 1.671 (1 synonymous 
change, 5 nonsynonymous changes). While neither of these ratios 
is significantly greater than one, it is notable that these two 
branches span a time window ~9-20 Mya, overlapping the ~40- 
12 Mya time period in the lineage leading to humans shown to 
exhibit positive selection in L1-ORF1 and of particularly intense 
LI activity [38]. 

Our finding that LI TD1 has evolved under positive selection in 
primates and mice would be unexpected if its sole function was in 
pluripotency; we would expect such genes to be highly conserved. 
We excluded an intriguing alternative possibility that genes 
involved in pluripotency might not be evolving as slowly as one 
would intuitively assume. In a genome-wide analysis of dN/ dS 
values calculated from trios of human-chimpanzee-macaque 
orthologs [42], genes identified in a screen for determinants of 
ES cell identity [43] are indeed evolving more slowly than control 
genes (Wilcoxon p = 0.008; Figure S2). Thus, our finding of 
positive selection in L1TD1 is indeed unexpected if its only role 
were in pluripotency maintenance. 

We cannot evaluate the extent to which positive selection has 
shaped protein regions that have experienced length-changing 
insertion-deletion changes, because they are unsuitable for codon- 
based analyses of positive selection. However, we note significant 
divergence in the glutamic-acid rich region of L1TD1 that 
separates the two LI ORFlp homology regions. For instance, New 
World monkeys have a deletion of — 140 amino acids at the 
beginning of L1TD1 coding exon 2 relative to the Old World 
monkeys and hominoids. The functional significance as well as the 
selective pressures that might have driven these changes is 
unknown; it is quite possible that these changes are completely 
neutral and have little impact on L1TD1 function. 

We also discovered a substantial restructuring of the L1TD1 
gene of bushbaby (Otolemur garnettii) that may have a significant 
impact on its function. In the bushbaby genome, we found that the 
latter two-thirds of the L1TD1 coding region are completely 
typical and align well with orthologous sequences from other 
primates and mammals, with no stop codons or frameshifts. 
However, the first one-third of bushbaby L1TD1 is not 
orthologous to any other L1TD1 (Figure 4B). Instead, it appears 
the bushbaby L1TD1 has acquired an entirely novel 5' end. Two 
processed L1TD1 pseudogenes in the bushbaby genome also 
include this novel 5' end. Since processed pseudogenes lack 
introns, they provide independent confirmation of the transcript 
structure of L1TD1. On closer examination, the novel 1400 
nucleotides at the beginning of the bushbaby L1TD1 includes 
~360 nucleotides of protein-coding sequence from ORFlp of a LI 
element of the LI PA 15- 16 class, a primate-specific LI element. 
The remaining ~ 1000-nucleotide 5' UTR region comprises a 
patchwork of several other repetitive elements, and likely arose by 
a series of nested insertion events. Although the repetitive element 
portions of this 5' UTR are found in several other loci in the 



bushbaby genome, the only places they are found in this particular 
combination are at the L1TD1 syntenic locus and in the L1TD1 
processed pseudogenes (Figure 4B). Thus, it appears that bush- 
baby L1TD1 has undergone a recent remodeling, replenishing its 
N-terminal region with a more current version of a LI element 
than the version captured in the original domestication event 
— 100 million years ago. The newly replenished version of 
bushbaby L1TD1 has evolved under purifying selection (Figure 
S3), ruling out the alternate possibility that this L1TD1 
rearrangement is a pseudogenization event. This novel L1TD1 
structure is not evident in the genomes of two lemur species, the 
gray mouse lemur (Microcebus murinus) and the aye-aye 
(Daubentonia madagascariensis), which both encode a more 
typical L1TD1, and we have not observed evidence of bush- 
baby-like L1TD1 restructuring in the any other mammalian 
genomes examined (Figure 3). This implies that the bushbaby 
LI TD1 remodeling occurred after the split between the lemur and 
bushbaby lineages, nearly 60 million years ago. Closer examina- 
tion of other prosimian lineages will allow a more precise dating of 
this event. 

From these forms of variation in the primate and mouse 
lineages, we conclude that genetic innovation has been adaptively 
selected for in L1TD1 through a number of mechanisms including 
mutation of individual amino acids and perhaps expansions and 
contractions of the glutamic acid-rich region. Importantly, the 
fixation of a new gene structure in the bushbaby derived from a 
recently active LI suggests selection for functional novelty through 
yet another LI -derived sequence. The signature of positive 
selection in both mice and primates suggests L1TD1 may be 
coevolving with some pathogen. Intriguingly, a strong signature of 
positive selection, indicative of such 'arms-races', has been 
previously seen in primate LI evolution, with an especially striking 
signature in ORFlp [38,40]. Combined with its origin as a 
domesticated LI gene and the novelty in bushbaby (which further 
emphasizes the selection pressure to maintain the L 1 -like character 
of L1TD1), we considered whether L1TD1 could be engaged in a 
genetic conflict with L 1 or some other retroelement. This function 
could parallel the known role of another domesticated retro- 
element gene, Fvl, in defense against elements similar to its 
progenitor [13,14]. To more clearly understand this hypothetical 
relationship, we reasoned that genes that function solely in LI 
restriction might be lost in species where active Lis have 
previously been shown to have gone extinct - the megabats and 
the rice rats. 

L1TD1 and the loss of L1 activity in megabat genomes 

Previous studies have demonstrated that a number of megabat 
species lack active LI elements. Indeed, it is likely that Lis 
experienced an extinction event in the megabat ancestor between 
24 and 58 million years ago [44]. We decided to explore the 
potentially antagonistic relationship between L1TD1 and LI by 
looking in bat species where the most detailed studies of LI 
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Intact Active 
L1TD1 Lis 

Dobsonia moluccensis + 

Pteropus hypomelanus 
Pteropus vampyrus 
Pteropus alecto 
Hypsignathus monstrosus 
Megaloglossus woermanni 

Rousettus amplexicaudatus - 

Rhinolophus ferrumequinum + + 

Rhinolophus eloquens +* + 

Myotis lucifugus + + 

Myotis brandtii + + 

Myotis davidii - + 

Eptesicus fuscus + + 

A. melanoleuca (panda) + + 

C. familiaris (dog) + + 



Figure 5. Loss of L1TD1 in megabats appears to follow the loss 
of LI activity. We obtained L1TD1 sequences from thirteen bat 
species (Materials and Methods). We show a species tree partly based 
upon a published megabat phylogeny [82], inferring placement of 
additional taxa using species from the same genus. In addition, we used 
our L1TD1 sequences to resolve relationships in the Pteropus and 
Myotis/Eptesicus clades. Species in which L1TD1 appears intact are 
shown in black, and those in which L1TD1 harbors inactivating 
mutations (stop codons, frameshifting insertions/deletions) are shown 
in gray. Some species share the same inactivating mutation(s) 
represented by the subscripts of the gray X symbols, suggesting 
L1TD1 was lost three independent times in the megabats. For one 
species, R. eloquens (starred), we were only able to obtain part of coding 
exon 1 of L1TD1, but this region is intact. Presence or absence of active 
Lis is based upon previous data [44] and analysis of the M. davidii 
genome assembly (Yang and Wichman, unpublished data). 
doi:1 0.1 371 /journal.pgen.1 004531. g005 

extinction have been completed. In addition to identifying L1TD1 
sequences from seven available chiropteran genome sequences, we 
amplified and sequenced the L1TD1 coding exons from six 
additional bat species (Figure 5). We were able to unambiguously 
determine the complete coding sequence of L1TD1 for twelve bat 
species; in the thirteenth species, Rhinolophus eloquens, we were 
only able to obtain part of coding exon 1 of L1TD1. 

Bats (Chiroptera) are divided into the monophyletic megabat 
clade Megachiroptera and a polyphyletic microbat clade, the 
Microchiroptera. Among the Microchiroptera, our survey re- 
vealed yet another case of an independent L1TD1 loss, in the 
genome of the microbat Myotis davidii (Figure 5). In this genome, 
our bioinformatics analysis (Yang and Wichman, unpublished 
data) shows that the most recently inserted Lis have ~99.5% 
identity to each other and retain their open reading frame; further, 
these highly similar sequences maintain conservation of amino 
acids known to be completely conserved across young mammalian 
Lis [45], suggesting Lis are still active in the M. davidii lineage. 
This situation parallels that of Cetartiodactyla and Afrotheria, 
which have both lost L1TD1 but likely maintain active Lis. 

In contrast, among the megabats, we found that most species 
that have lost active Lis have also lost LI TD1 . The only exception 
was the megabat species Dobsonia moluccensis, which has lost 
active Lis but possesses an apparendy intact L1TD1 that has 
evolved under purifying selection for at least some of the time since 
it diverged from other megabats (Figure S4). Our observation of 
unique inactivating coding mutations among subsets of the 
megabat species suggests there were at least three independent 
losses of L1TD1 (Figure 5, X's with subscripts). However, due to 
the poor resolution of the phylogenetic relationships of the other 



six megabat species with Dobsonia moluccensis (Figure 5), we 
cannot formally rule out the alternate possibility that the initial loss 
of L1TD1 function occurred once, via a non-coding mutation. 
Under both scenarios, we infer that the extinction of Lis preceded 
loss of L1TD1 in the megabats [44]. 

Thus, it appears that loss of LI may have led to loss oiLlTDl 
in megabats except in Dobsonia moluccensis (see Discussion). This 
evolutionary relationship between LI elements and L1TD1 is 
consistent with the possibility that L1TD1 was domesticated by 
mammalian genomes to antagonize Lis, similar to the domesti- 
cation of Fvl for defense against retroviruses. Under this model, 
loss of active Lis would relieve the selective pressure to maintain 
L1TD1 in megabat genomes. 

L1TD1 and the loss of L1 activity in rice rat genomes 

In addition to the loss of active Lis in the megabats, a group of 
rice rats in the subfamily Sigmodontinae are the only other group 
of mammals in which a loss of active Lis has been well 
characterized [45]. To investigate whether these species might 
exhibit a correlation similar to that of the megabats, in which L 1 
loss and L1TD1 loss co-occur, we sequenced the L1TD1 gene 
from ten LI -less Sigmodontinae species (5 complete sequences, 4 
coding exon Is, 1 coding exon 2), as well as two Sigmodontinae 
species that retain active Lis (Oecomys bicolor and Reithrodontomys 
fulvescens). In all cases, we found an intact coding sequence, 
suggesting these species have retained a functional copy of the 
L1TD1 gene (Dataset SI). 

We suggest two hypotheses for the apparent retention of Li TD1 
in these species, despite the absence of active Lis. First, it is 
possible that not enough time has passed for inactivating 
mutations to appear in the coding sequence of L1TD1 in these 
species; we note that the loss of active Lis in these species is much 
more recent than the loss of Lis in megabats (7.2-12.3 million 
years ago in the rodents vs 24—58 million years ago in the 
megabats). While we cannot definitively rule out the possibility 
that L1TD1 is neutrally evolving within these species, overall the 
tree is better fit by a single purifying dN/dS than by a dN/ dS fixed 
at 1 (p<0.001). Further, most of the Sigmodontinae sequences 
exhibit pairwise dN/ dS ratios that suggest purifying selection 
within the clade (Table SI). We therefore favor the second 
possibility that retention of a functional L1TD1 in these species 
results from selection pressures independent of Lis. For example, 
L1TD1 could play an essential role in pluripotency (despite the 
dispensable role of the Mus L1TD1). Intriguingly, despite having 
an overall tree length similar to our Mus alignment, we do not find 
any evidence of positive selection in the L1TD1 genes from 
Sigmodontinae using PAML NSsites on either full sequences (M7 
vs M8, p = 0.33), coding exon 1 alone (p = 0.53), or coding exon 2 
alone (p = 0.84). Alternatively, L1TD1 in sigmodonts could play a 
role in genome defense against another element active in these 
genomes. Indeed, the sigmodontine rodents have a highly active 
family of ERVs, the mysTR elements [46]. Expansion of this 
family preceded the death of Lis, but these elements are very 
active, with 3500 to 7000 species-specific insertions in the Ll- 
extinct species examined [47]. This recent ERV amplification in 
Sigmodontinae contrasts with the megabats (where L1TD1 has 
been lost in many species); there are apparently no highly active 
DNA or RNA elements in megabats [48]. IfLlTDl can suppress 
retroelements other than Lis, this could explain why the gene is 
retained in sigmodontine rodents but not in megabats. 

While these results do not provide additional support for our 
proposed correlation between LI loss and L1TD1 loss, they do 
provide another example of the potentially stochastic co-option 
of L1TD1 into essential cellular processes that has driven the 
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variable gain and loss of this gene throughout mammalian 
evolution. 

Discussion 

Domesticated genes such as LI TD1 provide a window into the 
opportunistic means by which host genomes can exapt new genetic 
functions from their resident mobile elements. LI elements and 
non-autonomous SINE elements that rely on LI for mobility 
comprise a substantial fraction of the human genome (at least 1 7 
and 30 percent, respectively [49]). This abundance, together with 
a relatively long history of vertical transmission has provided 
ample opportunities for host genome domestication of LI 
elements. Indeed, we see evidence of this exaptation for enhancer 
functions, in exonization of Alu elements and small LI fragments, 
and even in long non-coding RNAs [4,6,21-23,25-28]. However, 
L1TD1 is unique in being the only known host protein-coding 
gene whose protein coding sequence is derived almost entirely 
from the coding sequence of an LI. In addition, its exclusive 
expression in stem cells and the germline and the observation that 
L1TD1 appears to be required for pluripotency maintenance in 
human ES cells [31], make it a powerful example of retroelement 
domestication for an essential function. Exaptation for an 
important function such as pluripotency maintenance draws 
immediate parallels to other retroelement domestications such as 
the retroviral envelope-derived syncytin genes and the sushi- 
retroelement derived PeglO-relaXed genes, both of which serve 
essential roles in mammalian placentation [7,11,12]. Indeed, at 
first glance, our estimate of the evolutionary age of LI TD1 (only 
slightly younger than Peg 10 and older than any identified syncytin 
gene) does suggest long-term preservation for some important 
function. 

Nonetheless, our elucidation of LlTDVs evolutionary history 
challenges these parallels. For instance, we found multiple 
instances of L1TD1 loss or pseudogenization at different stages 
of mammalian evolution, in contrast to PeglO, which is intact in 
all examined mammals [1 1,50,51]. Furthermore, PeglO is subject 
to strong purifying selection (unpublished data), presumably due to 
the constraint imposed by its essential functions in placentation 
and genomic imprinting. In contrast, we have found evidence of 
diversifying selection acting on L1TD1 in primates and mice, 
along with genetic innovation in bushbaby via a partial 
replacement of L1TD1 coding sequence with the ORF1 of a 
younger LI element. This pattern of genetic innovation and 
sporadic loss is more reminiscent of another class of domesticated 
retroelement genes, exemplified by the Fvl retroviral gag-derived 
gene in Mus species, which serves as a form of host defense against 
a broad array of incoming retroviral capsids [15,18]. 

It is noteworthy that germline and early embryonic tissues 
represent the primary batdeground for the conflict between hosts 
and LI elements. In order to propagate new copies to future 
generations, Lis must successfully retrotranspose in these cells. 
However, these are the very compartments where a host could 
incur the greatest fitness cost from LI replication. The highly 
specific expression pattern of LI TD1 suggests that its target could 
be genomic retroelements that mobilize during germline develop- 
ment and early embryonic stages. 

Three findings lead us to consider the intriguing speculation 
that L1TD1 may target LI elements themselves (although other 
functions for L1TD1 are quite possible). First, we find that 
extinction of LI activity appears to have been followed by L1TD1 
loss in several megabat lineages. This relationship between LI loss 
and L1TD1 loss is far from perfect. For example, in the microbat 
Myotis davidii and other mammalian genomes (Afrotheria and 



Cetartiodactyla), L1TD1 loss is not accompanied by LI extinction. 
This is easily explained, because a number of different restriction 
factors exist to defend against LI in mammalian genomes [52], 
and in particular genomes, the previously beneficial function of 
L1TD1 may become redundant and dispensable. Also, if a rapidly 
evolving LI were to escape restriction by L1TD1, control over this 
rogue element could be regained through adaptive mutations in 
restriction factors besides L1TD1, thereby relaxing selective 
pressure to maintain L1TD1. Alternatively, LI retrotransposition 
rates are known to have varied over evolutionary time: periods 
where LI retroposition rates are low could also lead to relaxed 
selective pressures on L1TD1 retention. A similar relaxation of 
selective pressure has been invoked to explain the idiosyncratic loss 
of Fvl in Mus species [15]. Second, we found that bushbaby 
L1TD1 underwent a partial replacement with a more current 
version of LI ORF1. Although such 'replenishment' could 
represent a neutral event unique to the bushbaby, the fact that it 
has occurred suggests a model in which the 'newer' version of 
LI TBI was fixed in the bushbaby genome because it conferred a 
selective advantage over the pre-existing version, which itself 
formed from a much more ancient LI domestication. Third, our 
observation that L1TD1 has experienced positive selection in 
primates (and mice) is consistent with previous findings that LI 
ORF1 has evolved under positive selection during primate 
evolution [38,40]. Like Fvl action against distandy homologous 
retroviral capsids, the observed positive selection suggests the 
possibility that L1TD1 may be in direct conflict with LI ORFlp 
from active LI elements. Since LI elements also provide the 
machinery for non-autonomous elements like SINEs [53] and 
likely HAL Is [54], it is also possible that these elements represent 
the true targets of L1TD1, rather than LI itself. 

L1TD1 could function as a LI restriction factor through a 
number of possible mechanisms. First, analogously to Fvl, L1TD1 
might interfere with the homotrimerization of LI -ORFlp that is 
necessary for LI retrotransposition. This trimerization normally 
happens through ORFlp's coiled-coil domain [55]. We note that 
LI TBI coding exon 2 has preserved its coiled-coil domain, which 
may be necessary for this interaction. Under this model, L1TD1 
could act as a dominant negative version of ORFlp. Its binding to 
ORFlp could lead to LI restriction, which could then drive 
episodes of diversifying selection in LI ORFlp to escape L1TD1 
restriction [40]. In response to these escape mutations in LI, we 
might expect L1TD1 to rapidly evolve to restrict the variant LI - a 
situation consistent with the positive selection we observe in 
L1TD1. Another possibility is that L1TD1 could outcompete LI 
ORFlp for binding to the ORF2p protein, the likely rate-limiting 
component for retrotransposition. Alternatively, LlTDl's RNA- 
binding activity could be the source of its restrictive abilities. In 
this model, L1TD1 could act as a competitive inhibitor of LI 
ORFlp, binding the RNA of transcribed retroelements and 
blocking the downstream pathway that would normally create new 
copies. This could function through simple titration or through a 
more complex mechanism whereby L1TD1 localizes LI RNA to 
P-bodies [32] for sequestration or decapping and subsequent 
degradation. The previously reported association of L1TD1 with 
the TRIM28 (KAP1) complex [56] suggests yet another possible 
mechanism of restriction. TRIM28 is known to function in the 
silencing of exogenous and endogenous retroviral integrations by 
recruitment of heterochromatin proteins to some retroviral 
elements [57]. Under this epigenetic silencing model, L1TD1 
may function in an analogous role to the zinc finger adaptor 
proteins that help target TRIM28 and consequently the silencing 
complex to particular genomic sequences. Thus, constant adap- 
tation to bind either DNA or RNA at LI integration sites could 
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drive diversifying selection of L1TD1 . It is intriguing to imagine a 
model wherein the silencing machinery could target any genomic 
parasite through a modular adapter protein that recognizes the 
newly integrated sequence (although it must be noted that such 
activities have not been ascribed to LI ORFlp from which 
L1TD1 derived). Further characterization will shed light into 
whether L1TD1 encodes any or all of these biochemical activities. 

When we examined the genome sequences of bats, we found 
that many species that have lost active L 1 s have also lost LI TD1 . 
However, Dobsonia moluccensis still maintains LI TD1 as an intact 
protein-coding gene, despite the fact that this species lost active 
Lis many million years previously. Similarly, we found L1TD1 
intact in the subset of the Sigmodontinae rodents previously shown 
to have lost active Lis [45]. This implies that the proposed 
selection for restricting active LI elements cannot be the only 
constraint that dictates retention of L1TD1. We hypothesize that 
idiosyncratic, highly divergent LI insertion patterns in different 
mammalian genomes could provide an explanation of both the 
retention of L1TD1 in mammals without an active LI and the 
essential role in pluripotency maintenance that L1TD1 plays in 
humans. Under this model, LlTDVs role as either important or 
dispensable for pluripotency maintenance depends on the pattern 
of where Lis have inserted in the genome of each species. In 
various contexts, transposable elements have been shown to affect 
the expression of genes in their vicinity [58,59], and some elements 
are included in the untranslated regions of host mRNAs [60,61]. If 
L1TD1 transcriptionally or post-transcriptionally silences LI 
elements, this silencing could alter the expression of host genes 
near Lis, or with Lis in their UTRs. Under this model, in some 
mammalian genomes, LI insertions would not be in proximity to 
genes that impact the pluripotency program (Figure S5). In these 
species, extinction of LI would result in the loss of constraint on 
L1TD1 (e.g., megabat P. vampyrus), and experimental knock- 
down of LI TD1 would have no effect on pluripotency mainte- 
nance (e.g., M. musculus [33]). In contrast, in other mammalian 
genomes where LI insertions occurred near genes whose 
repression is important for pluripotency maintenance, loss of 
L1TD1 would lead to loss of pluripotency (e.g., H. sapiens [31]). 
In fact, repression could have been incorporated into any number 
of functionally important pathways in the cells where L1TD1 is 
expressed, simply depending on which genes Lis landed next to 
during evolution. We hypothesize that in such species, extinction 
of LI activity would not relax the selective pressures to maintain 
L1TD1 (e.g., megabat D. moluccensis, Sigmodontine rice rats), 
which would now be required to recognize 'dead' LI copies in 
order to maintain these repressive programs. 

Although we present the hypothesis that L1TD1 was originally 
domesticated as an anti-Ll restriction factor and was subsequently 
recruited for pluripotency regulation in humans by virtue of direct 
repression of LI elements near functionally relevant genes, this idea is 
at present highly speculative. L1TD1 has experienced a long 
evolutionary history functioning in the context of other factors 
expressed in pluripotent cells, and it is possible that L1TD1 could 
have been coopted by these factors. For instance, L1TD1 could be 
involved in regulating RNA involved in pluripotency via its ancestral 
chaperone function, or in regulating important protein complexes via 
protein-protein interactions. We look forward to experimental 
investigation of all of these hypotheses; none of them can completely 
explain all of our evolutionary observations (L1TD1 loss and 
retention in various lineages, positive selection and remodeling in 
bushbaby), perhaps suggesting that LlTDl's function has changed 
multiple times over the course of mammalian evolution. 

In summary, we posit that LlTDVs original function could 
have been in genome defense (similar to Fvl), and that it still has 



defense functions in many mammalian species where it is retained 
(including humans). We note that our evidence for L1TD1 
diversifying selection (and inferred genome defense) comes in part 
from the primate lineage, which is also where its role in 
pluripotency maintenance is best established. Later in evolution, 
by virtue of either its transcriptional or post-transcriptional 
silencing of LI, L1TD1 may have become intricately enmeshed 
into the transcriptional program of LI -proximal genes in some 
species, meaning that LI TD1 also acquired an absolutely essential 
function (similar to Peg 10 and syncytin). An interesting feature of 
this model is that LlTDl's essential function (or lack thereof) is 
thus a consequence of stochastic L 1 insertion patterns in different 
mammalian genomes. This stochasticity could provide an expla- 
nation for why LI TD1 has been lost in some species, and why it is 
responsible for pluripotency maintenance in only some mamma- 
lian genomes. 

Materials and Methods 

Ethics statement 

The animals used in this study are wild-derived laboratory 
animals of the species Mus musculus, Mus spretus, Mus spicilegus, 
Mus mattheyi and Apodemus uralensis. None of these species are 
protected. Permits for catching the founding members of each line 
were not required at the time they were caught. Some specimens 
were caught on the properties of private landowners, with their 
oral permission to enter the property and catch mice. All animal 
work was carried out by experienced personnel at the Max Planck 
Institute for Evolutionary Biology, following the legal requirements 
in accordance with German animal welfare law (Tierschutzgesetz) 
and FELASA guidelines. Permits for keeping mice were obtained 
from the local veterinary office "Veterinaramt Kreis Plon" (permit 
number: 1401-144/PLO-004697). The dissection of animals and 
organ extractions were performed according to the German 
Animal Welfare Act § 8a Abs. 1 Nr. 3b TierSchG; 'Organ-/ 
Gewebsentnahme zu wissenschaftlichen/diagnostischen Zwecken 
TierSchG' V 312-72241.123-34. 

Biological materials 

Sigmodontine rodent liver tissue was obtained on loan from the 
Museum at Texas Tech University (Lubbock, TX). 

For Mus samples, mice of different ages were sacrificed by C0 2 
asphyxiation followed by cervical dislocation. Mice were then 
dissected and tissues were snap-frozen within 5 minutes post- 
mortem. Liver (front view: front left lobe), both testes, and whole 
brain including brain stem were collected. For the outbred 
populations, Iran (AH), France (M), and Germany (CB) for Mus 
musculus domesticus, and Austria (WI) and Kazakhstan (KH) for 
Mus musculus musculus, eight individuals each were sampled. For 
inbred groups, Mus musculus castaneus (TA), Mus spretus (SP), 
Mus spicilegus (SC), Mus mattheyi (MA) and Apodemus uralensis 
(AP), four individuals each were sampled. All mice were obtained 
from the mouse collection at the Max Planck Institute for 
Evolutionary Biology. 

Mus transcriptome sequencing, processing, and 
mapping/assembly 

The sampled tissues of each Mus group were used for RNA 
extraction with the RNAeasy Kit (Qiagen) and pooled at 
equimolar concentrations. Quality of the RNA was measured 
with BioAnalyzer chips (Agilent), for the individual samples and 
pools, and samples with RIN values below 7.5 were discarded. 
The pools were subsequently submitted to the Cologne Center for 
Genomics (CCG) for further processing and sequencing. The 
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sequencing of the samples was performed using a polyA tail 
purification step, followed by cDNA synthesis, Illumina library 
preparation, and sequencing with an Illumina HiSeq 2000 
sequencer. Each transcriptome sample was sequenced in approx- 
imately one third of a HiSeq2000 flow-cell lane (one flow-cell lane 
per taxon). 

All raw data files were trimmed for adaptors and quality using 
Trimmomatic [62]. The quality trimming was performed base- 
wise, removing bases below quality score of 20 (Q20), and keeping 
reads whose average quality was of at least Q30. Reads whose 
trimmed length was shorter than 40 bases were excluded from 
further analyses, and pairs missing one member because of poor 
quality were also removed from any further analyses. 

Quality-filtered transcriptome reads were aligned against the 
mmlO version of the mouse reference genome from UCSC [63] 
using NextGenMap [64]. Reads which were ambiguously or 
poorly mapped (MAPQ <20) were removed from the analyses. 
Quality-filtered transcriptome reads for each taxon were merged 
into a single input file, discarding tissue information, and 
assembled de novo with the Trinity platform [65] using default 
parameters. 

Sequence collection 

L1TD1 sequences were obtained from publicly available 
primate genome databases using PSI-BLAST [66] against the 
NR database or TBLASTN [66] against the HTGS database, with 
human L1TD1 as a search seed (Table S2). For sequencing of 
other primate, Sigmodontinae, and Mus species (Table S2), exon 1 
and exon 2 were PCR amplified individually from genomic DNA 
using oligonucleotides designed against intronic regions. PCR 
primers were designed against intronic regions neighboring each 
exon based upon an alignment of mouse, rat, and Chinese hamster 
for rodents or human, rhesus, and squirrel monkey for primates 
(Primates Exonl Sense: CAGAATCCAGTCTTGACAACATA- 
TCC; Primates Exonl Antisense: CAGGAGAATCACTTGAAC- 
CTGGG; Primates Exon2 Sense: GTCAGAATGGAAGCCA- 
TATTAAAATTAGTG; Primates Exon2 Antisense: GCTATTA- 
GCTGTCCATCCTTCTGG; Rodent Exonl Sense: GYAAG- 
WAMAYTTTC ATTTGYTTATAKTTC ; Rodent Exonl Anti- 
sense: CCYATCARTYTCTRGAACYCCYRTCAARC; Rodent 
Exon2 Sense: GGMAAGYATACTAAATTYAGAGGGTRA- 
AATAG; Rodent Exon2 Antisense: AASTCAACCAACMYK- 
CAGRKAGTK). PCR products were sequenced using standard 
Sanger sequencing. 

For Mus sequences, informative reads were obtained with 
samtools [67] from alignments overlapping with the L1TD1 
annotation. Known sequences from the coding sequences of 
L1TD1 were used to identify the most similar assembled contigs in 
each taxon using nucleotide-nucleotide blast [68] (e-value <le- 
10). 

L1TD1 sequences from primate, Sigmodontinae, and Mus 
species have been deposited in Genbank under accession numbers 
KJ994281-KJ994329. 

Alignments and positive selection analysis 

Primate and Mus nucleotide sequences were aligned using the 
ClustalW 'translation align' function in Geneious Pro (Biomatters 
Ltd.). Alignments were refined manually, including truncation of 
the poorly aligned glutamic-acid rich region, and this alignment 
and an established primate phylogeny [69] or Mus LI TD1 gene 
tree were input into the CODEML sites model of PAML [70] to 
detect positive selection at individual sites. Positively selected sites 
were classified as those sites with a M8 Bayes empirical Bayes 
posterior probability >90%. FUBAR was performed using the 



web-based implementation of HyPhy (www.datamonkey.org) 
[71,72]. To test for signatures of positive selection along individual 
branches of the primate phylogeny, we used Branch-site REL in 
HyPhy [41] or the branch model of PAML. For PAML, the 
statistical significance of any branch that showed dN/ dS > 1 in the 
free ratio model (model = 1) was tested using a two-ratio tree 
(model = 2) by specifying the branch of interest as a foreground 
branch and all other branches as background branches. We then 
compared the likelihood of a model where the foreground branch 
had a freely estimated dN/dS with the likelihood of a model where 
that branch had dN/dS fixed at the neutral value of 1 [39]. 

For the analysis shown in Figure S2, we obtained dN/ dS values 
from a genome-wide analysis of trios of human-chimpanzee- 
macaque orthologs [42] , made available via Adam Siepel's website 
(http://compgen.bscb.cornell.edu/orthologs). To be conservative, 
we filtered out dN/dS values that might be artificially high for 
technical reasons: we eliminated values calculated from alignments 
of fewer than 100 codons, eliminated alignments that contained 
fewer than 5 evolutionary changes of any type, and included only 
RefSeq genes and not other genes derived from less confident 
annotation sets. In this dataset, alignments with no synonymous 
changes had apparent dN/dS values of 999; again, to be 
conservative, we replaced dN/dS values for these genes with a 
more conservative estimate of dN/dS = 2. We converted sequence 
identifiers to gene symbols using Bioconductor [73], allowing us to 
cross-reference dN/dS values with genes identified in a screen for 
ES cell determinants [43]. For the plot shown in Figure S2, we 
selected a list of 127 ES cell determinants validated by secondary 
screening using three markers of stem cell identity [43]; other 
overlapping lists of validated hits from this screen show similar 
evolutionary patterns. 

Mammalian sequence collection and alignment 

To identify previously unannotated copies of L1TD1 in 
sequenced genomes, the genomic sequence between flanking 
genes INADL and KANK4 was extracted and compared with the 
corresponding human genomic region (INADL-L1TD1-KANK4) 
using Dotter [74]. Visual inspection of Dotter output enabled 
extraction of L1TD1 protein-coding sequences. Multispecies 
LI TD1 alignments were generated using CLUSTALW [75] with 
manual adjustment. For all mammals mentioned in the text, 
genome sequences were searched for non-syntenic and/or 
pseudogene copies of L1TD1 using TBLASTN with a Repeat- 
Masked version [48] of human and mouse L1TD1 sequences, and 
in some cases L1TD1 sequence from a more closely related 
species. 

To identify representative LI ORFlp sequences (Dataset S2) 
used to build the tree shown in Figure 2, we first selected a small 
number of LI ORFlp consensus sequences from RepBase [76] 
and obtained the sequence of an active human LI from Genbank 
(LRE2, accession AAB60344.1). We then used TBLASTN [66] 
against whole genome assemblies (Table S2) to identify a single 
intact LI -ORFlp sequence from the genome assemblies of each 
species shown in Figure 2, selecting a copy arbitrarily from among 
the blast hits that did not show inactivating mutations in ORFlp. 

Phylogenetic inference 

To construct the tree shown in Figure IB, we used our 
alignment of L1TD1 nucleotide sequences and the jModelTest 2 
program [77] to determine that the best-fitting evolutionary model 
for this alignment is the GTR model with invariant sites and 
gamma distributed rates (GTR+I+G). We then generated a 
maximum-likelihood phylogeny using PhyML [78] and the 
GTR+I+G evolutionary model with four site categories for the 
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gamma distribution. 1000 replicate trees were constructed, and the 
tree with the highest log-likelihood was chosen and displayed using 
MEGA5 [79]. 

To construct the tree shown in Figure 2, predicted protein 
sequences were aligned by hand (Dataset S2). We used ProtTest 3 
[80] to determine that the best-fitting evolutionary model for this 
alignment is the JTT model with gamma-distributed rates (JTT+ 
G). We then used PHYML to generate maximum-likelihood trees 
with the JTT+G model (four site categories for the gamma 
distribution). 1000 replicate trees were constructed, and the tree 
with the highest log-likelihood was chosen and displayed using 
MEGA5 [79]. Bootstrap values represent the percentage of trees in 
which the descendent taxa cluster together. 

Dates of divergence mentioned in the text were generated using 
the TimeTree web service [81]. 

Bat sequencing and database sequence collection 

We obtained L1TD1 sequences for Pteropus alecto, Pteropus 
vampyrus, Rhinolophus ferrumequinum, Myotis lucifugus, Myotis 
davidii, Myotis brandtii and Eptesicus fuscus from publicly 
available genome sequences. For other bat species, genomic 
DNA samples were obtained from tissues from The Museum, 
Texas Tech University. Sample accession numbers were previ- 
ously published [44]. Degenerate PCR primers were designed 
against intronic regions neighboring each exon based upon an 
alignment of P. vampyrus and M. lucifugus sequences (Exonl 
Sense : TTTC AGATGATTTTKTAAAWWGAKTTTRGGGG; 
Exonl Antisense: TYYTMWYWAWTWAAMARSTGTTAAS- 
YYWTTSTTC; Exon2 Sense: TGGGGWTCCMAGCCTTYA- 
AGAAMAAATC; Exon2 Antisense: CATCMCCAAGTATA- 
CTGTTAGCTGTCCATC). For Rousettus amplexicaudatus 
and Rhinolophus eloquens we designed a second set of primers 
based upon the new sequences we generated to amplify Exonl 
(Sense: AAATATCACCCACATGGAAAGAATTAG; Antisense: 
TTCTCTTGAATCCCATATCTTCTTCC). PCR products 
were sequenced by Sanger sequencing. For Nyctimene albiventer, 
Megaerops niphanae and Cynopterus sphinx, we were unable to 
amplify or sequence any product from either exon with multiple 
PCR optimizations of annealing temperature, amount of genomic 
DNA template and concentration of magnesium ion. For two bat 
species, it appeared that more than one closely related L1TD1 
sequence was present. Upon cloning and sequencing multiple 
PCR products, we found that M. woermanni had 4 sequence 
variants and R. amplexicaudatus had 2 variants. For each species, 
at least one inactivating mutation was shared by all variants. 

Supporting Information 

Dataset SI In-frame nucleotide alignment of L1TD1 sequences. 
The fasta-formatted alignment includes all L1TD1 sequences 
described in the manuscript. The inactivating mutations in L1TD1 
that are shared by subsets of megabat species (Figure 5) are at the 
following nucleotide positions of the alignment: P. hypomelanus, 
P. alecto and P. vampyrus share three mutations: a stop codon at 
bp 34-36; a stop codon at bp 106-108; a 1 bp insertion at bp 
~774. H. monstrosus and M. woermanni share a different set of 
three mutations: a stop codon at bp 385-387; a 1 bp deletion at bp 
720; a 1 bp deletion at bp 939. R. amplexicaudatus does not share 
any of the above mutations, but has numerous lineage-specific 
inactivating mutations. 
(TXT) 

Dataset S2 Amino acid alignment of Ll-ORFlp and the two 
ORFlp-homologous regions of L1TD1. The fasta-formatted 



alignment was used to generate the phylogenetic tree shown in 
Figure 2. 

(FA) 

Figure SI Positive selection in Mus L1TD1 is robust to 
alternative phylogenies from potential recombination breakpoints. 
We used HyPhy's GARD program to test for recombination 
breakpoints in the Mus L1TD1 alignment that could give rise to 
false signatures of positive selection. We found three potential 
breakpoints, though none was statistically significant (KH test, p> 
0.1). To ensure that our detection of positive selection was robust 
to the use of these alternative phylogenies, we performed PAML 
NSsites on slices of the alignment corresponding to each of the 
GARD trees, and we used the built-in functionality of DataMon- 
key to use these GARD-generated trees to identify positively- 
selected positions using FUBAR. Top, the whole gene tree is 
shown in black. The three breakpoints are shown on the schematic 
of the L1TD1 gene with vertical lines, and the slices they delimit 
are shown in different colors. Below each gene segment is shown 
the GARD-generated tree that best describes that region. Both the 
PAML NSsites signature of selection and FUBAR-identified 
selected sites are robust to the use of these alternative phylogenies. 
(PDF) 

Figure S2 Determinants of embryonic stem cell identity evolve 
more slowly than control genes. The boxplots show the 
distribution of dN/dS values for genes identified in a screen for 
determinants of human ES cell identity [43] and for all other genes 
in a genome-wide dataset of dN/dS values calculated from trios of 
orthologs from human, chimpanzee and macaque genomes [42]. 
Oudiers with dN/ dS> 1 are omitted; gray dots represent other 
oudiers - the large number of datapoints for "other genes" 
precludes visualization of individual datapoints. The stem cell 
determinants are evolving more slowly than other genes (Wilcoxon 
p = 0.008). LlTDVs dN/dS value in this genome-wide dataset is 
shown using a red horizontal line; it is evolving faster than most 
other pluripotency genes. Although L1TD1 did not meet the 
arbitrary threshold (Fav score < — 2) used to identify stem cell 
determinants in the published RNAi screen, its score in the screen 
(Fav = —0.90) is well below the genome-wide average, consistent 
with previous results that human L1TD1 is a pluripotency factor 
[31,32]. 
(PDF) 

Figure S3 Intact LI TD1 from bushbaby Ololemur garnettii has 
evolved under purifying selection. We used codeml's free-ratio 
model to estimate selective pressures on LI TD1 on each branch of 
the two trees shown. Above each branch we show estimated dN/ 
dS ratios, and in parentheses below each branch we show the 
estimated number of non-synonymous and synonymous changes, 
respectively. L1TD1 pseudogenes are shown with their labels in 
gray. For three selected branches we performed likelihood tests of 
whether the estimated dN/dS ratio is a significantly better fit to 
the data than dN/ dS = 1 for that branch, by assuming only two 
dN/dS ratios for the entire tree (one ratio for the branch in 
question, and one for all other branches), and comparing a model 
where dN/dS for the branch in question was fixed at 1 with a 
model where dN/dS was freely estimated. Results of those tests are 
shown with red superscripts. A. We examined evolution of the full- 
length ORF of the novel bushbaby L1TD1 structure, comparing it 
to two processed pseudogenes in the bushbaby genome that arose 
after this novel L1TD1 structure formed. The intact bushbaby 
L1TD1 is more likely evolving under purifying than neutral 
selection (p — 0.0003; **). B. We examined the portion of 
bushbaby L1TD1 that aligns to the ancestral L1TD1 gene, 
including human and aye-aye L1TD1 genes as outgroups. The 
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intact bushbaby L1TD1 is more likely evolving under purifying 
than neutral selection (p<10~ 5 ; **); there is also weaker support 
(p = 0.07; *) for purifying selection on the shared ancestor of the 
intact bushbaby L1TD1 and bushbaby processed pseudogene B. 
(PDF) 

Figure S4 Intact L1TD1 from megabat Dobsonia moluccensis has 
evolved under purifying selection. We used codeml's free-ratio model 
to estimate selective pressures on L1TD1 on each branch of the species 
tree shown. Above each branch we show estimated dN/ dS ratios, and 
in parentheses below each branch we show the estimated number of 
non-synonymous and synonymous changes, respectively. Species in 
which LI TBI is a pseudogene are shown with their labels in gray. For 
three branches, we performed likelihood tests of whether the estimated 
dN/ dS ratio is a significantly better fit to the data than dN/ dS = 1 for 
that branch, by assuming only two dN/ dS ratios for the entire tree (one 
ratio for the branch in question, and one for all other branches), and 
comparing a model where dN/ dS for the branch in question was fixed 
at 1 with a model where dN/dS was freely estimated. Results of those 
tests are shown with red superscripts: the Dobsonia moluccensis branch 
is more likely evolving under purifying than neutral selection 
(p = 0.003), as is the branch ancestral to D. moluccensis and the three 
Pteropus species (p = 0.002). In contrast, for the Pkropus alecto branch, 
a neutral model is as good a fit to the data as a model invoking 
purifying selection (denoted by n.s., for non-significant). 
(PDF) 

Figure S5 LI TD1 could become essential based upon the pattern 
of LI insertions in a specific genome. We present a scheme under 
which L1TD1 could idiosyncratically adopt a pluripotency role in a 
species-specific fashion. A. We schematize the genomes of three 
exemplar species with an identical stretch of seven genes (rounded 
boxes), one of which must be silenced for pluripotency maintenance 
(orange boxes). B. Lis (black rounded boxes) insert randomly into 
each genome. One insertion (species 3) happens to be near the 
silenced gene. C. According to our hypothesis, L1TD1 silences the 
newly inserted Lis (grayed areas), as well as genes near LI 
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insertions. In the case of species 3, the gene that must be silenced is 
now silenced redundantly by the ancestral mechanism and L1TD 1 . 
Because of this redundancy, either LI TD1 or the ancestral silencing 
could be lost. D. Loss of the ancestral silencing mechanism would 
render L1TD1 essential. A loss of L1TD1 in this case (species 3) 
would result in the expression of the orange gene, which must 
remain silenced for survival. In this way, L1TD1 could be co-opted 
as an essential regulator of pluripotency. If L1TD1 instead targets 
Lis near arbitrary genes with no influence on the pluripotent state 
of a cell (species 1 and 2), loss of LI TD1 would not affect the 
maintenance of pluripotency; in this case, LI TD1 would be retained 
only if its restriction or other functions were beneficial. 
(PDF) 

Table SI Maximum likelihood estimates of pairwise dN/dS for 

L1TD1 from Sigmodontinae rodents. 

(XLSX) 

Table S2 L1TD1 Genbank accessions or coordinates in 

mammalian genome assemblies searched. 

(XLSX) 
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